Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PHP-DEV] PHP taint support: first results

6 views
Skip to first unread message

Wietse Venema

unread,
Oct 2, 2007, 8:41:35 PM10/2/07
to
A while ago I posted a proposal to add support for tainted variables
to PHP, to alert programmers at run-time when they make the common
mistake of using uncleansed input with include, echo, system, open,
etc. A copy of the original post can be found on-line, for example,
at http://marc.info/?l=php-internals&m=116621380305497&w=2

After working on this over the summer in bursts, I now have a first
implementation that adds taint support to the core engine and to
a selection of built-ins and extensions. Not surprisingly, the
initial plan needed to be updated in the light of experience. The
good news is that performance is better than I hoped it would be.

A quick example
---------------

To give an idea of the functionality, consider the following program
with an obvious HTML injection bug:

<?php
$username = $_GET['username'];
echo "Welcome back, $username\n";
?>

With default .ini settings, this program does exactly what the
programmer wrote: it echos the contents of the username request
attribute, including all the malicious HTML code that an attacker
may have supplied along with it.

When I change one .ini setting:

taint_error_level = E_WARNING

the program produces the same output, but it also produces a warning:

Warning: echo(): Argument contains data that is not converted
with htmlspecialchars() or htmlentities() in /path/to/script
on line 3

Changing E_WARNING into E_ERROR causes execution to terminate, as
one would expect. These and other settings can appear in php.ini,
on the PHP command line, or they can be made with ini_set() calls
in the application itself.

Current status
--------------

My plan is to release something in the next few days, and to go
through a several iterations while adjusting the course based on
the feedback I receive.

The performance is good: the overhead for "make test" is within the
measurement error of 1-2% (when comparing user-mode CPU time). I
do realize that a portion of the total "make test" time is spent
in non-PHP processing, but with the current numbers it doesn't
really matter if taint overhead is 1% or 2%. If someone has a
better "macro" benchmark, then of course I'm interested.

Although the code is still green, its present form already shows
potential to do things like labeling privacy sensitive data (social
security or credit card numbers) and preventing such data from
flying out the window via HTML web pages.


Multiple flavors of taint
-------------------------

As many know, a PHP program should use htmlspecialchars() or
htmlentities() before rendering data as HTML output. And before
rendering data in other "output" like contexts, a program should
use the proper conversion function, too: escapeshellcmd() etc.
for shell commands, mysqli_real_escape_string() for mysqli queries.
These conversions are not only needed for security reasons, they
are also required for robustness; shell or SQL commands may fail
unexpectedly when given a legitimate name such as O'Reilly.

What I have implemented encourages programmers to do exactly that:
using the RIGHT conversion function. To achieve this goal, the code
implements multiple flavors of taint.

In my initial post I proposed black-and-white taint, but I changed
my mind for human interface reasons. To help programmers eliminate
security holes from their code, they need to be told how to fix the
problem. With black-and-white taint, the system just didn't know
if the programmer had chosen the right conversion function. The
tool was too simplistic for PHP's complex environment.

Low-level implementation
------------------------

Taint support is implemented with some of the unused bits in the
zval data structure. The zval is the PHP equivalent of a memory
cell. Besides a type (string, number etc.) and value, each zval has
a reference count and a flag that says whether the zval is a reference
to yet another zval that contains the actual value.

Right now I am using seven bits, but there is room for more: 32-bit
UNIX compilers such as GCC add 16 bits of padding to the current
zval data structure, and this amount isn't going to be smaller on
64-bit architectures. If I really have to squeeze the taint bits
in-between the existing bits, the taint support performance hit
goes up to perhaps 2% in the macro benchmarks that I mentioned
above (but again, this number is barely above the 1-2% measurement
error, so don't take it too literally).

The preliminary configuration user interface is rather low-level,
somewhat like MS-DOS file permissions :-( This is good enough for
testing and debugging the taint support itself, but I would not
want to have wires hanging out of the machine like this forever.
The raw bits will need to be encapsulated so that applications can
work with meaningful names and abstractions.

To give an idea of what the nuts and bolts look like, this is the
preliminary list of bits, or should I say: binary properties,
together with the parameters that control their handling:

- TC_HTML:
- By default, data with this bit is not allowed in HTML output
(parameter: taint_checks_html). This requirement is not
enforced with the default setting of taint_error_level = 0.
- The htmlspecialchars() and htmlentities() functions produce
output without this bit.
- By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms)
or from elsewhere (parameter: taint_marks_other).

- TC_SHELL:
- By default, data with this bit is not allowed in shell commands
(parameter: taint_checks_shell). This requirement is not
enforced with the default setting of taint_error_level = 0.
- The escapeshellarg() and escapeshellcmd() functions produce
output without this bit.
- By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms)
or elsewhere (parameter: taint_marks_other).

- TC_MYSQL:
- By default, data with this bit is not allowed in mysql_query()
(parameter: taint_checks_mysql). This requirement is not
enforced with the default setting of taint_error_level = 0.
- The mysql_real_escape_string() function produces output without
this bit.
- By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or
elsewhere (parameter: taint_marks_other).

- TC_MYSQLI:
- Like TC_MYSQL, except this one is for the mysqli_query() API.
The TC_MYSQLI bit is removed with mysqli_real_escape_string().

- TC_SELF:
- By default, data with this bit cannot be used for internal
control operations such as eval, include, as file name argument,
as network destination, or in other contexts where someone could
take away control from the application (parameter:
taint_checks_self). This requirement disappears with
taint_error_level = 0.
- I haven't yet implemented a dedicated conversion function;
to shut up the warnings, this data needs to be marked as
"safe" with a low-level untaint($foo, TC_SELF) call.
- By default, this bit is set on all data from the web (parameter:
taint_marks_egpcs).

- TC_USER1, TC_USER2: These are labels that an application can set
on specific data. For example, it could set these bits when credit
card or social security numbers come out of a database. The
taint_checks_html policy for HTML output (see above) would then
be configured to disallow data with not only with the TC_HTML
property, but also with TC_USER1 or TC_USER2. Obviously some
polished user interface would need to be built on top of this to
make application-defined attributes usable.

Taint propagation policy
------------------------

Before implementing the above policies, the first order of business
was adding taint propagation to the PHP core: for each operator,
including type conversion, a decision had to made how to propagate
taint from source operands to results.

The general taint propagation rules are:

- Pure arithmetic and pure string operations propagate all the taint
bits from their operands to their results. The rules become more
complicated with operators whose operands have different types.

- Conversions from string to number or to boolean remove all but a
few taint bits (by default, only the TC_SELF bit stays).

This prevents silly warnings about having to use htmlspecialchars()
or mysql_real_escape_string() when rendering non-string data in
SQL/HTML/shell context, while still protecting the application
against some control hijacking attacks.

- Conversions from number or boolean to string preserve all the
taint bits.

- Comparison operators currently ignore taint bits.

Most of this taint propagation is done, but there are a few minor
issues that still need to be resolved.

- Something needs to be done when functions like parse_str() are
given tainted data: the question is how to represent the taintedness
of the resulting hash table lookup keys. These strings could be
harmful when used in file names, in database names, or in other
critical information.

- Something needs to be done about null strings; it would be silly
to insist that data composed from a null string be converted with
htmlspecialchars() etc. On the other hand, a null string does
change the syntactical structure of information, so we have to
be careful.

While adding taint propagation I found that a lot of PHP source
code fails to use the official macros when initializing a zval. In
these cases I added another line of code to initialize the taint
bits by hand. Also, more internal documentation (other than empty
man page skeletons) could have reduced development time significantly.

Major loose ends
----------------

I already mentioned the loose wires hanging out of the machine; the
user interface for taint policy control will need to be made more
suitable for people who aren't primarily interested in PHP core
hacking.

Support for tainted objects is still incomplete. In particular,
conversions from objects to non-objects may lose taint bits.

For now, I manually added taint support to a number of standard
built-ins (file, process, *scanf, *printf, and a subset of the
string functions) and extensions (mysql, mysli). I hope this will
be sufficient to get some experience with taint support.

Right now, taint-unaware extensions will work properly as long as
taint checks are disabled (the default), and as long as they are
recompiled with the patched PHP header files. When taint checking
is turned on, some extensions may cause false alarms when they fail
to use the official macros to initialize zval structures, thereby
leaving some taint bits at uninitialized values.

I still hope that it will somehow be possible to annotate extensions
so that taint support can be added without modifying lots of extension
source code. However, having multiple flavors of taint, instead of
just one, will make the job so much more interesting.

Distant future
--------------

Currently, only data is labeled (currently, only with binary
attributes). No corresponding attributes exist for sources and sinks
(files, network, databases, etc.). If we knew that a connection
is encrypted, or whether something is an intranet or extranet
destination, or some other property, then we could implement more
sophisticated policies than the simple MS-DOS like file permissions
that I have implemented now.

But all this is miles beyond the immediate problem that I am trying
to solve today: helping programmers find the holes in their own code
before other people do.

Wietse

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

David Wang

unread,
Oct 2, 2007, 9:37:15 PM10/2/07
to
On 10/2/07, Wietse Venema <wie...@porcupine.org> wrote:
> Low-level implementation
> ------------------------
>
> Taint support is implemented with some of the unused bits in the
> zval data structure. The zval is the PHP equivalent of a memory
> cell. Besides a type (string, number etc.) and value, each zval has
> a reference count and a flag that says whether the zval is a reference
> to yet another zval that contains the actual value.

One possible problem area is that I'm using some of those very same
bits for storing GC information. I'm not using very many, so there
should be plenty to go around, but I'm currently using 3 bits in
is_ref. 2 for the "color" (the GC state of the object) and 1 for
whether or not the object is internally buffered. Just a heads up.

Some details: I have the leftmost bit in is_ref be the indicator of
is_ref. This is so I can test is_ref by just testing (z->is_ref >=
0x80). The reason I'm doing it like this is because I find that this
comparison test is a lot faster than doing a bitwise test. I then use
the other top three bits for what I need, so the bottom 4 bits are
free.

We really should conserve space, though. The biggest problem with
increasing the size of the zval struct seems to be (incredibly) L1
cache misses, as verified with cachegrind. Even increasing the size of
the current zval struct by one byte has a measurable impact (my goal
was to keep Zend/bench.php from showing any hit at all).

David

Wietse Venema

unread,
Oct 2, 2007, 10:39:32 PM10/2/07
to
David Wang:
[ Charset ISO-8859-1 unsupported, converting... ]

> On 10/2/07, Wietse Venema <wie...@porcupine.org> wrote:
> > Low-level implementation
> > ------------------------
> >
> > Taint support is implemented with some of the unused bits in the
> > zval data structure. The zval is the PHP equivalent of a memory
> > cell. Besides a type (string, number etc.) and value, each zval has
> > a reference count and a flag that says whether the zval is a reference
> > to yet another zval that contains the actual value.
>
> One possible problem area is that I'm using some of those very same
> bits for storing GC information. I'm not using very many, so there
> should be plenty to go around, but I'm currently using 3 bits in
> is_ref. 2 for the "color" (the GC state of the object) and 1 for
> whether or not the object is internally buffered. Just a heads up.

Did you read this?

Right now I am using seven bits, but there is room for more: 32-bit
UNIX compilers such as GCC add 16 bits of padding to the current
zval data structure, and this amount isn't going to be smaller on
64-bit architectures. If I really have to squeeze the taint bits
in-between the existing bits, the taint support performance hit
goes up to perhaps 2% in the macro benchmarks that I mentioned
above (but again, this number is barely above the 1-2% measurement
error, so don't take it too literally).

I found micro benchmarks in particular to be very processor dependent
(examples: empty loop, loop adding numbers, loop concatenating
strings; some tests were faster with taint, and some slower; change
the processor and some differences were the other way around). This
was with the same OS and the same compiler. Obviously such
differences are meaningless, and this is why I looked at macro
benchmarks.

Wietse

Marcus Boerger

unread,
Oct 3, 2007, 11:37:13 AM10/3/07
to
Hello Wietse,

Wednesday, October 3, 2007, 2:35:33 AM, you wrote:


> Changing E_WARNING into E_ERROR causes execution to terminate, as
> one would expect. These and other settings can appear in php.ini,
> on the PHP command line, or they can be made with ini_set() calls
> in the application itself.

Sounds like a pretty good solution to me.

> Current status
> --------------

> My plan is to release something in the next few days, and to go
> through a several iterations while adjusting the course based on
> the feedback I receive.

> The performance is good: the overhead for "make test" is within the
> measurement error of 1-2% (when comparing user-mode CPU time). I
> do realize that a portion of the total "make test" time is spent
> in non-PHP processing, but with the current numbers it doesn't
> really matter if taint overhead is 1% or 2%. If someone has a
> better "macro" benchmark, then of course I'm interested.

> Although the code is still green, its present form already shows
> potential to do things like labeling privacy sensitive data (social
> security or credit card numbers) and preventing such data from
> flying out the window via HTML web pages.

~2% sounds even better to me.

> Low-level implementation
> ------------------------


> - TC_MYSQL:
> - By default, data with this bit is not allowed in mysql_query()
> (parameter: taint_checks_mysql). This requirement is not
> enforced with the default setting of taint_error_level = 0.
> - The mysql_real_escape_string() function produces output without
> this bit.
> - By default, this bit is set on all data from the web (parameter:
> taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or
> elsewhere (parameter: taint_marks_other).

> - TC_MYSQLI:
> - Like TC_MYSQL, except this one is for the mysqli_query() API.
> The TC_MYSQLI bit is removed with mysqli_real_escape_string().

Couldn't we do a general SQL thing here? Or at leat a small set of different
SQL behaviors to capture all databases we support? However we should at
least have a verison that works for PDO/

> - TC_SELF:
> - By default, data with this bit cannot be used for internal
> control operations such as eval, include, as file name argument,
> as network destination, or in other contexts where someone could
> take away control from the application (parameter:
> taint_checks_self). This requirement disappears with
> taint_error_level = 0.
> - I haven't yet implemented a dedicated conversion function;
> to shut up the warnings, this data needs to be marked as
> "safe" with a low-level untaint($foo, TC_SELF) call.
> - By default, this bit is set on all data from the web (parameter:
> taint_marks_egpcs).

To me this sounds more like TC_PHP, TC_SELF just doesn't sounds right.

> - TC_USER1, TC_USER2: These are labels that an application can set
> on specific data. For example, it could set these bits when credit
> card or social security numbers come out of a database. The
> taint_checks_html policy for HTML output (see above) would then
> be configured to disallow data with not only with the TC_HTML
> property, but also with TC_USER1 or TC_USER2. Obviously some
> polished user interface would need to be built on top of this to
> make application-defined attributes usable.

> Taint propagation policy
> ------------------------

> While adding taint propagation I found that a lot of PHP source


> code fails to use the official macros when initializing a zval. In
> these cases I added another line of code to initialize the taint
> bits by hand. Also, more internal documentation (other than empty
> man page skeletons) could have reduced development time significantly.

Right. The same is the problem with the GC patches. Once again I think we
should disallow all direct access to zval members by inserting different
prefixes to zval members. That way wrong usage would make all other compiles
fail. So you'd just need to check whether debug and non debug works and be
safe. Another approach is to hide the actual zval struct and only provide
access functions. But that is slower.

Best regards,
Marcus

laurent jouanneau

unread,
Oct 5, 2007, 7:37:25 AM10/5/07
to
(Wietse Venema) wrote:
> To give an idea of the functionality, consider the following program
> with an obvious HTML injection bug:
>
> <?php
> $username = $_GET['username'];
> echo "Welcome back, $username\n";
> ?>
>
> With default .ini settings, this program does exactly what the
> programmer wrote: it echos the contents of the username request
> attribute, including all the malicious HTML code that an attacker
> may have supplied along with it.
>
> When I change one .ini setting:
>
> taint_error_level = E_WARNING
>
> the program produces the same output, but it also produces a warning:
>
> Warning: echo(): Argument contains data that is not converted
> with htmlspecialchars() or htmlentities() in /path/to/script
> on line 3

A PHP application doesn't always generate HTML : it can generate JSON,
CSV, PDF etc.. In this case, we don't have to call htmlspecialchars etc..

Is this warning appearing also when you want to output datas other than
HTML ? If no, how your code guess the output type ? If yes, how can we
disable this warning in pages which produce JSON etc. ?

Laurent

---
http://jelix.org

Wietse Venema

unread,
Oct 5, 2007, 11:59:21 AM10/5/07
to
laurent jouanneau:

> (Wietse Venema) wrote:
> > To give an idea of the functionality, consider the following program
> > with an obvious HTML injection bug:
> >
> > <?php
> > $username = $_GET['username'];
> > echo "Welcome back, $username\n";
> > ?>
> >
> > With default .ini settings, this program does exactly what the
> > programmer wrote: it echos the contents of the username request
> > attribute, including all the malicious HTML code that an attacker
> > may have supplied along with it.
> >
> > When I change one .ini setting:
> >
> > taint_error_level = E_WARNING
> >
> > the program produces the same output, but it also produces a warning:
> >
> > Warning: echo(): Argument contains data that is not converted
> > with htmlspecialchars() or htmlentities() in /path/to/script
> > on line 3
>
> A PHP application doesn't always generate HTML : it can generate JSON,
> CSV, PDF etc.. In this case, we don't have to call htmlspecialchars etc..

In that case, I suppose you would not be using echo, so there
is no problem.

Wietse

> Is this warning appearing also when you want to output datas other than
> HTML ? If no, how your code guess the output type ? If yes, how can we
> disable this warning in pages which produce JSON etc. ?

--

M. Sokolewicz

unread,
Oct 5, 2007, 12:52:21 PM10/5/07
to
(Wietse Venema) wrote:
> laurent jouanneau:
>> (Wietse Venema) wrote:
>>> To give an idea of the functionality, consider the following program
>>> with an obvious HTML injection bug:
>>>
>>> <?php
>>> $username = $_GET['username'];
>>> echo "Welcome back, $username\n";
>>> ?>
>>>
>>> With default .ini settings, this program does exactly what the
>>> programmer wrote: it echos the contents of the username request
>>> attribute, including all the malicious HTML code that an attacker
>>> may have supplied along with it.
>>>
>>> When I change one .ini setting:
>>>
>>> taint_error_level = E_WARNING
>>>
>>> the program produces the same output, but it also produces a warning:
>>>
>>> Warning: echo(): Argument contains data that is not converted
>>> with htmlspecialchars() or htmlentities() in /path/to/script
>>> on line 3
>> A PHP application doesn't always generate HTML : it can generate JSON,
>> CSV, PDF etc.. In this case, we don't have to call htmlspecialchars etc..
>
> In that case, I suppose you would not be using echo, so there
> is no problem.
>
You wouldn't? So, when outputting a script-generated pdf file, how would
you do that if not using echo? (and thus also not print since that's
pretty much the exact same thing)

Wietse Venema

unread,
Oct 5, 2007, 1:17:07 PM10/5/07
to
M. Sokolewicz:

Never mind.

The code that creates PDF produces data that carries no taint
labels. The taint labels don't jump into existence spontaneously,
they have to be added by whoever creates that data.

Thus, because PDF is created without taint labels, the echo operator
will not object to echoing it.

> >> Is this warning appearing also when you want to output datas other than
> >> HTML ? If no, how your code guess the output type ? If yes, how can we
> >> disable this warning in pages which produce JSON etc. ?

Guessing is something that we do in inferior software.

Data carries taint labels that say what can't be done with it. If
the PDF creator etc. does not label its output, then there are no
restrictions.

Wietse

Rasmus Lerdorf

unread,
Oct 5, 2007, 2:21:51 PM10/5/07
to

This doesn't make much sense to me.

Consider very common (abbreviated) code like this:

$user_data = $_REQUEST['data'];
switch($output_format) {
case 'html':
echo "<html>$user_data</html>";
break;
case 'xml':
header('Content-type: text/xml');
echo "<xml>$user_data</xml>";
break;
case 'json':
header('Content-type: application/json');
echo json_encode(array($user_data));
break;
}

$user_data is going to be tainted, but the untainting rules are very
different for those 3 cases and popping up an error that talks about
html escaping only makes sense in the html case. That's part of what I
was talking about months ago when I talked about the problem with
context-less tainting.

-Rasmus

Wietse Venema

unread,
Oct 5, 2007, 2:31:33 PM10/5/07
to
Rasmus Lerdorf:

> Consider very common (abbreviated) code like this:
>
> $user_data = $_REQUEST['data'];
> switch($output_format) {

Question: where is the output format feature documented?

Once I know the output format is not HTML, then I know
that applying HTML-style restrictions is not appropriate.

I did search around but came up empty handed.

Wietse

Rasmus Lerdorf

unread,
Oct 5, 2007, 2:35:26 PM10/5/07
to
Wietse Venema wrote:
> Rasmus Lerdorf:
>> Consider very common (abbreviated) code like this:
>>
>> $user_data = $_REQUEST['data'];
>> switch($output_format) {
>
> Question: where is the output format feature documented?
>
> Once I know the output format is not HTML, then I know
> that applying HTML-style restrictions is not appropriate.
>
> I did search around but came up empty handed.

That's just an example. People building REST-based web services will
typically include an output format parameter. Or they determine the
output format based on other criteria. It's completely up to the script
author which output format he wants to use and how he wants to make that
decision.

-Rasmus

Wietse Venema

unread,
Oct 5, 2007, 2:47:24 PM10/5/07
to
Rasmus Lerdorf:

> Wietse Venema wrote:
> > Rasmus Lerdorf:
> >> Consider very common (abbreviated) code like this:
> >>
> >> $user_data = $_REQUEST['data'];
> >> switch($output_format) {
> >
> > Question: where is the output format feature documented?
> >
> > Once I know the output format is not HTML, then I know
> > that applying HTML-style restrictions is not appropriate.
> >
> > I did search around but came up empty handed.
>
> That's just an example. People building REST-based web services will
> typically include an output format parameter. Or they determine the
> output format based on other criteria. It's completely up to the script
> author which output format he wants to use and how he wants to make that
> decision.

Would perhaps the Content-Type: information be stored somewhere in
HTTP response, or is this hopeless because of multi-part documents?

If Content-Type: header information were available, then I could use
that to choose the proper rule.

As I mentioned, I don't want to get involved with guessing games.

Wietse

Rasmus Lerdorf

unread,
Oct 5, 2007, 2:52:00 PM10/5/07
to
Wietse Venema wrote:
> Rasmus Lerdorf:
>> Wietse Venema wrote:
>>> Rasmus Lerdorf:
>>>> Consider very common (abbreviated) code like this:
>>>>
>>>> $user_data = $_REQUEST['data'];
>>>> switch($output_format) {
>>> Question: where is the output format feature documented?
>>>
>>> Once I know the output format is not HTML, then I know
>>> that applying HTML-style restrictions is not appropriate.
>>>
>>> I did search around but came up empty handed.
>> That's just an example. People building REST-based web services will
>> typically include an output format parameter. Or they determine the
>> output format based on other criteria. It's completely up to the script
>> author which output format he wants to use and how he wants to make that
>> decision.
>
> Would perhaps the Content-Type: information be stored somewhere in
> HTTP response, or is this hopeless because of multi-part documents?
>
> If Content-Type: header information were available, then I could use
> that to choose the proper rule.
>
> As I mentioned, I don't want to get involved with guessing games.

Yes, you can check the content-type at the time of the echo and whatever
the value is is what will go out. Unless of course output buffering is
turned on, in which case the echo doesn't actually output anything until
the buffer is flushed, and that would be the point at which you would
need to check it. The buffer can also be tossed, so the buffered echo
may not ever go out.

-Rasmus

Stut

unread,
Oct 5, 2007, 2:58:10 PM10/5/07
to
Wietse Venema wrote:
> Rasmus Lerdorf:
>> Wietse Venema wrote:
>>> Rasmus Lerdorf:
>>>> Consider very common (abbreviated) code like this:
>>>>
>>>> $user_data = $_REQUEST['data'];
>>>> switch($output_format) {
>>> Question: where is the output format feature documented?
>>>
>>> Once I know the output format is not HTML, then I know
>>> that applying HTML-style restrictions is not appropriate.
>>>
>>> I did search around but came up empty handed.
>> That's just an example. People building REST-based web services will
>> typically include an output format parameter. Or they determine the
>> output format based on other criteria. It's completely up to the script
>> author which output format he wants to use and how he wants to make that
>> decision.
>
> Would perhaps the Content-Type: information be stored somewhere in
> HTTP response, or is this hopeless because of multi-part documents?
>
> If Content-Type: header information were available, then I could use
> that to choose the proper rule.
>
> As I mentioned, I don't want to get involved with guessing games.

In that case why not assume HTML but provide a way to specify otherwise.
It could be a specific function call or using ini_set.

I don't think it's unreasonable to require scripts outputting content
other than HTML to include a line that modifies the default behaviour.
Surely the benefits far outweigh that cost.

-Stut

--
http://stut.net/

Wietse Venema

unread,
Oct 5, 2007, 3:01:47 PM10/5/07
to
Wietse Venema:

> Rasmus Lerdorf:
> > Wietse Venema wrote:
> > > Rasmus Lerdorf:
> > >> Consider very common (abbreviated) code like this:
> > >>
> > >> $user_data = $_REQUEST['data'];
> > >> switch($output_format) {
> > >
> > > Question: where is the output format feature documented?
> > >
> > > Once I know the output format is not HTML, then I know
> > > that applying HTML-style restrictions is not appropriate.
> > >
> > > I did search around but came up empty handed.
> >
> > That's just an example. People building REST-based web services will
> > typically include an output format parameter. Or they determine the
> > output format based on other criteria. It's completely up to the script
> > author which output format he wants to use and how he wants to make that
> > decision.
>
> Would perhaps the Content-Type: information be stored somewhere in
> HTTP response, or is this hopeless because of multi-part documents?

As shown in Rasmus's example, when the information is supplied with
a header("Content-Type: foo") call before the output is echo-ed,
I can determine the MIME type. I just need to hook into the header()
function and do a little parsing.

Thanks for clarifying this mystery.

Wietse

Rasmus Lerdorf

unread,
Oct 5, 2007, 3:07:52 PM10/5/07
to
Stut wrote:
> Wietse Venema wrote:
>> Rasmus Lerdorf:
>>> Wietse Venema wrote:
>>>> Rasmus Lerdorf:
>>>>> Consider very common (abbreviated) code like this:
>>>>>
>>>>> $user_data = $_REQUEST['data'];
>>>>> switch($output_format) {
>>>> Question: where is the output format feature documented?
>>>>
>>>> Once I know the output format is not HTML, then I know
>>>> that applying HTML-style restrictions is not appropriate.
>>>>
>>>> I did search around but came up empty handed.
>>> That's just an example. People building REST-based web services will
>>> typically include an output format parameter. Or they determine the
>>> output format based on other criteria. It's completely up to the script
>>> author which output format he wants to use and how he wants to make that
>>> decision.
>>
>> Would perhaps the Content-Type: information be stored somewhere in
>> HTTP response, or is this hopeless because of multi-part documents?
>>
>> If Content-Type: header information were available, then I could use
>> that to choose the proper rule.
>>
>> As I mentioned, I don't want to get involved with guessing games.
>
> In that case why not assume HTML but provide a way to specify otherwise.
> It could be a specific function call or using ini_set.
>
> I don't think it's unreasonable to require scripts outputting content
> other than HTML to include a line that modifies the default behaviour.
> Surely the benefits far outweigh that cost.

That's already there. They set the content-type. The problem becomes
when they set it vs. when output goes out. It's also very common to
turn on output buffering and buffer a bunch of stuff and then set the
content-type just before flushing the buffer.

-Rasmus

Stut

unread,
Oct 5, 2007, 3:10:41 PM10/5/07
to

Maybe it's enough for the tainting to switch context when a new content
type is set. I'm trying to think of a situation where you might not be
able to determine the requested output format because the tainting gets
in the way, but I can't come up with anything. However, it would likely
break a number of existing scripts which is not ideal.

-Stut

--
http://stut.net/

Wietse Venema

unread,
Oct 5, 2007, 3:20:10 PM10/5/07
to
Rasmus Lerdorf:

> > I don't think it's unreasonable to require scripts outputting content
> > other than HTML to include a line that modifies the default behaviour.
> > Surely the benefits far outweigh that cost.
>
> That's already there. They set the content-type. The problem becomes
> when they set it vs. when output goes out. It's also very common to
> turn on output buffering and buffer a bunch of stuff and then set the
> content-type just before flushing the buffer.

That practice would be incompatible with taint checks. I suppose
having to specify the type first is not a burden for the application,
because the application knows what output it is going to produce,
before it produces it.

Taint policy is applied to echo (and print, etc.) arguments. It
can't be applied to the contents of the output buffer, because that
would be prohibitively expensive. One would need to record that
byte A came from function B argument C at file D line E, and that
byte A had a particular set of taint labels.

Wietse

Wietse Venema

unread,
Oct 5, 2007, 3:21:34 PM10/5/07
to
Stut:

> > That's already there. They set the content-type. The problem becomes
> > when they set it vs. when output goes out. It's also very common to
> > turn on output buffering and buffer a bunch of stuff and then set the
> > content-type just before flushing the buffer.
>
> Maybe it's enough for the tainting to switch context when a new content
> type is set. I'm trying to think of a situation where you might not be
> able to determine the requested output format because the tainting gets
> in the way, but I can't come up with anything. However, it would likely
> break a number of existing scripts which is not ideal.

This would not break any scripts. Taint support must not be turned
on unless the application was written to support it.

0 new messages